This is my personal project page for the University of Helsinki statistics course named “Introduction to Open Data Science”.
My name is Pinja-Liina Jalkanen.
Please be forewarned that my statistics skills are rather crappy and usually limited to some simple descriptives and regressions. I do know some tech stuff quite well, though, so I don’t expect great difficulties with the R syntax, Git or the like. Kimmo Vehkalahti said on the first course lesson that data analysis is often 90% preparation of the data and 10% doing the analysis. I don’t know about the weighing on this course, but I usually don’t need help with the 90% at all; that’s mostly like, RTFM, STFW and the like anyway.
With regard to the 10% though, I don’t trust my skills at all.
Link to this project: https://github.com/pinjaliina/IODS-project
The data of the analysis in this exercise was the Autumn 2014 Introductory Statistics Course Learning Questionnaire. The data has 60 variables and 183 observations; except for the few background related variables (age, attitude towards statistics, course points), most of the questions were learning-related questions with answers given on Likert scale, from 1 to 5.
For the analysis, I’ve averaged the variables related to deep learning, surface learning, and strategic learning. (For the initial data wrangling part of this exercise, see this R script).
The data is read in as follows:
# Read in the data
lrn2014a <- as.data.frame(read.table('data/learning2014.csv', sep="\t", header=TRUE))
A scatter plot matrix of the variables of the data can be drawn as follows, coloured by the gender variable:
p <- ggpairs(lrn2014a, mapping = aes(col = gender), lower = list(combo = wrap("facethist", bins = 20)))
p
A summary of each of the variables of the data can be displayed as follows:
summary(lrn2014a)
## age attitude points gender
## Min. :17.00 Min. :14.00 Min. : 7.00 F:110
## 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 M: 56
## Median :22.00 Median :32.00 Median :23.00
## Mean :25.51 Mean :31.43 Mean :22.72
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75
## Max. :55.00 Max. :50.00 Max. :33.00
## deep stra surf
## Min. :1.625 Min. :1.583 Min. :1.250
## 1st Qu.:3.500 1st Qu.:2.417 1st Qu.:2.625
## Median :3.875 Median :2.833 Median :3.188
## Mean :3.796 Mean :2.787 Mean :3.121
## 3rd Qu.:4.250 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.875 Max. :4.333 Max. :5.000
As can be seen from the output, most of the variables – with the exception of the age of the students – are distributed quite randomly and only correlate weakly with the course points. The most significant exception is the attitude towards statistics, which correlates more strongly with the course points than any other variable. Except for the age and points, the distribution of most of the variables seems to be reasonably close to normal distribution. The gender variable demonstrates that the course was attended by significantly more female than male students.
A linear regression model can be fit to the data as follows:
lrn2014_model <- lm(points ~ attitude + stra + surf, data = lrn2014a)
The chosen explanatory variables for the model were attitude, strategic learning and surface learning. The summary of the model can be printed as follows:
summary(lrn2014_model)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = lrn2014a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.01711 3.68375 2.991 0.00322 **
## attitude 0.33952 0.05741 5.913 1.93e-08 ***
## stra -0.58607 0.80138 -0.731 0.46563
## surf 0.85313 0.54159 1.575 0.11716
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
However, as can be seen from the model summary, the estimates of strategic and surface learning have no statistical significance explaining the course points; given the weak correlation of those variables with the points this was somewhat expected. It thus makes more sense to eliminate at least the variable that has the lowest probability value, strategic learning:
lrn2014_model <- lm(points ~ attitude + surf, data = lrn2014a)
summary(lrn2014_model)
##
## Call:
## lm(formula = points ~ attitude + surf, data = lrn2014a)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.97290 2.39591 3.745 0.00025 ***
## attitude 0.34658 0.05652 6.132 6.31e-09 ***
## surf 0.91365 0.53447 1.709 0.08927 .
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
With the removal of that value the statistical significance of the surface learning estimates improves somewhat to near significant levels and can be left to the model. In practice its effect to the model is so tiny that it nearly as well be left out, but because it is reasonably close to statistical significance and also gives highest adjusted R2 value – the adjusted R2 of a model where attitude is the only explanatory variable as reported by summary() would be 0.1856 – its inclusion to the model is justified.
While the model estimates are statistically significant, the model as a whole is not very good: the multiple R2 value is only 0.20, which means that about 80 % of the relationship between the dependent variable – course points – and the explanatory variables remains unexplained. Thus, any predictions based on the model alone might not be very reliable.
In addition to linearity, linear models are fitted with the assumption that:
The validity of these assumption can be tested by analysing the residuals of the model. This can be done with the help of different kinds of diagnostic plots. In the following figure three different plots are drawn:
par(mfrow = c(1,3)) # Set some graphical params.
# It seems that the drawing order of the plot is independent of the vector
# order and can't be changed.
plot(lrn2014_model, which = c(1,2,5))
Interpretation of the plots:
The data of the analysis in this exercise was Fabio Pagnotta’s and Hossain Mohammad Amran’s Using Data Mining To Predict Secondary School Student Alcohol Consumption (2008), published by Department of Computer Science of the University of Camerino (link: https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION). For the analysis, I’ve also created a combined total alcohol consumption variable (sum of weekday and weekend consumption) and created a separate logical high use variable, based on the total consumption. (For the initial data wrangling part of this exercise, see this R script).
The data is read in as follows:
# Read in the data
alc <- as.data.frame(read.table('data/alc.csv', sep="\t", header=TRUE))
A glimpse of all of the variables of the data can be displayed as follows:
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fctr> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15...
## $ address <fctr> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fctr> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, ...
## $ Pstatus <fctr> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4...
## $ Mjob <fctr> at_home, at_home, at_home, health, other, ser...
## $ Fjob <fctr> teacher, other, other, services, other, other...
## $ reason <fctr> course, course, other, home, home, reputation...
## $ nursery <fctr> yes, no, yes, yes, yes, yes, yes, yes, yes, y...
## $ internet <fctr> no, yes, yes, yes, no, yes, yes, no, yes, yes...
## $ guardian <fctr> mother, father, mother, mother, father, mothe...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1...
## $ failures <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup <fctr> yes, no, yes, no, no, no, no, yes, no, no, no...
## $ famsup <fctr> no, yes, no, yes, yes, yes, no, yes, yes, yes...
## $ paid <fctr> no, no, yes, yes, yes, yes, no, no, yes, yes,...
## $ activities <fctr> no, no, no, yes, no, yes, no, no, no, yes, no...
## $ higher <fctr> yes, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ romantic <fctr> no, no, no, yes, no, no, no, no, no, no, no, ...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2...
## $ absences <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0,...
## $ G1 <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14,...
## $ G2 <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14,...
## $ G3 <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE...
As can be seen from the above variable list, there are both numerical and factorial background variables. The target variable for this analysis is the binary high/low alcohol consumption variable. To analyse that, I’ve chosen the following four variables that I assume are indicative of students’ alcohol consumption:
A summary and some plots of the chosen variables are shown below (boxplots’ whiskers extend to 75% of the interquartile range). I also grouped the box plots by sex to see any potential differences between them:
summary(alc[c('absences','goout','G3','studytime')])
## absences goout G3 studytime
## Min. : 0.000 Min. :1.000 Min. : 0.00 Min. :1.000
## 1st Qu.: 0.000 1st Qu.:2.000 1st Qu.: 8.00 1st Qu.:1.000
## Median : 3.000 Median :3.000 Median :11.00 Median :2.000
## Mean : 5.319 Mean :3.113 Mean :10.39 Mean :2.034
## 3rd Qu.: 8.000 3rd Qu.:4.000 3rd Qu.:14.00 3rd Qu.:2.000
## Max. :75.000 Max. :5.000 Max. :20.00 Max. :4.000
p1 <- ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + xlab('high use')
p2 <- ggplot(alc, aes(goout)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count") + xlab('going out')
p3 <- ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab('final grade') + xlab('high use')
p4 <- ggplot(alc, aes(studytime)) + geom_bar(aes(fill = high_use), position = "dodge", stat="count")
# The multiplot() is defined in the init section of this file. For details, see
# http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
multiplot(p1, p2, p3, p4, cols = 2)
Based on the above exploration of the variables, it looks like all my previously stated hypothetical assumptions were true to at least some extent, with perhaps the exception of final grade. To confirm this, I did a logistic regression analysis using my chosen variables as the explanatory variables.
In the following, a model is fitted to the data and summarised.
m <- glm(high_use ~ absences + goout + G3 + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + G3 + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8166 -0.7495 -0.5034 0.8071 2.5083
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.62195 0.62727 -4.180 2.92e-05 ***
## absences 0.05600 0.01643 3.407 0.000656 ***
## goout 0.74536 0.12013 6.204 5.49e-10 ***
## G3 0.01269 0.02848 0.446 0.655854
## studytime -0.60004 0.16852 -3.561 0.000370 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 383.16 on 377 degrees of freedom
## AIC: 393.16
##
## Number of Fisher Scoring iterations: 4
By the model summary, it looks like my hypothesis was wrong with regard the final grade, which wasn’t a good predictor at all of high alcohol consumption: it bears no statistical significance whatsoever to it. All the other chosen explanatory variables are relatively strong predictors. Calculating the odds ratios hints to the same direction:
or <- exp(coef(m))
or
## (Intercept) absences goout G3 studytime
## 0.07266094 1.05759534 2.10719756 1.01277150 0.54879019
As shown by the odds ratios, a student has virtually the same likelihood to consume much alcohol regardless of the grade. Interestingly, the absences are not much of a factor either: a student with lot of absences is only 1.06 times more likely to consume a lot of alcohol. With regard outgoingness and studytime the results are, however, very clear: a student who goes out a lot is two times more likely to also drink a lot. Same goes for studytime: a student who studies a lot is almost half less likely to drink a lot. But while absence doesn’t increase the likelihood of high use a lot, comparing its odd ratio to its confidence interval still confirms its statistical significance:
ci <- exp(confint(m))
cbind(or, ci)
## or 2.5 % 97.5 %
## (Intercept) 0.07266094 0.02042166 0.2402599
## absences 1.05759534 1.02602413 1.0954317
## goout 2.10719756 1.67554867 2.6863511
## G3 1.01277150 0.95846241 1.0720167
## studytime 0.54879019 0.38995929 0.7564410
As shown by the confidence intervals of the odd ratios, the confidence intervals of all the other explanatory variables except that of final grade (G3) steer well clear of one, which means that the likelihoods that the odds predicted by them are low. With regard the final grade, however, my initial hypothesis was clearly wrong because the value one is almost in the middle of its confidence interval. Thus, before making any actual predictions using the model, it’s best to refit it without that variable:
m <- glm(high_use ~ absences + goout + studytime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ absences + goout + studytime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8376 -0.7530 -0.4927 0.8091 2.4967
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.47603 0.53168 -4.657 3.21e-06 ***
## absences 0.05627 0.01651 3.409 0.000651 ***
## goout 0.73711 0.11835 6.228 4.72e-10 ***
## studytime -0.59422 0.16800 -3.537 0.000405 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 383.36 on 378 degrees of freedom
## AIC: 391.36
##
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m)), exp(confint(m)))
## 2.5 % 97.5 %
## (Intercept) 0.08407616 0.02874986 0.2322992
## absences 1.05788483 1.02622976 1.0958710
## goout 2.08988646 1.66701101 2.6539672
## studytime 0.55199236 0.39259400 0.7600378
As shown by the above statistics, the remaining explanatory variables are now all statistically highly significant and have confidence intervals well clear of the value one, so the model is now ready to be used for predictions.
The model can be used for predictions as follows:
# Predict the probability.
probabilities <- predict(m, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability = probabilities)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction = probability > 0.5)
# Tabulate the target variable versus the predictions,
# with both absolute and proportional numbers.
tbl <- table(high_use = alc$high_use, prediction = alc$prediction)
addmargins(tbl)
## prediction
## high_use FALSE TRUE Sum
## FALSE 246 24 270
## TRUE 66 46 112
## Sum 312 70 382
round(addmargins(prop.table(tbl)), 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.64 0.06 0.71
## TRUE 0.17 0.12 0.29
## Sum 0.82 0.18 1.00
As the tables show, the model is too careful in its predictions; it predicts less occurrences of high use than what the actual data shows. This can also be demonstrated graphically:
hu <- as.data.frame(prop.table(table(alc$high_use)))
pred <- as.data.frame(prop.table(table(alc$prediction)))
pp1 <- ggplot(hu, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('observed high use') + theme(legend.position = 'none')
pp2 <- ggplot(pred, aes(Var1, Freq)) + geom_col(aes(fill = Var1)) + scale_y_continuous(limits = 0:1) + ylab('frequency') + xlab('predicted high use') + theme(legend.position = 'none')
multiplot(pp1, pp2, cols = 2)
The actual model training error can be calculated as follows (note that this is a function only because one is needed later on for cv.glm()):
mloss <- function(obs, prob) {
res <- ifelse(prob > 0.5, 1, 0)
mean(res != obs)
}
round(mloss(obs = alc$high_use, prob = alc$probability), 2)
## [1] 0.24
The training error is 24%, thus the model has a little over 75% accuracy. This isn’t perfect, but likely still better than any simple guessing strategy, given that by guessing alone I wasn’t able to predict my chosen variables’ statistical significance correctly.
To test the model further, cross-validation can be performed. The following performs a 10-fold cross-validation:
cv <- cv.glm(data = alc, cost = mloss, glmfit = m, K = 10)
cv$delta[1] # Print the average number of wrong predictions.
## [1] 0.2382199
The average training error of the 10-fold cross-validation is 0.24, which is already better performance than the model introduced in the DataCamp exercise has. However, the model could be improved further by adding more variables. This model, however, does not include sex as a variable, while according to the DataCamp exercise it is a statistically significant variable. Thus, we could try adding it and run the cross-validation to further improve the model:
m2 <- glm(high_use ~ absences + goout + studytime + sex, data = alc, family = "binomial")
summary(m2) # Summarise the model.
##
## Call:
## glm(formula = high_use ~ absences + goout + studytime + sex,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7586 -0.7653 -0.4897 0.7254 2.5996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.23486 0.59934 -5.397 6.76e-08 ***
## absences 0.06269 0.01687 3.716 0.000202 ***
## goout 0.74151 0.12061 6.148 7.86e-10 ***
## studytime -0.45081 0.17155 -2.628 0.008594 **
## sexM 0.81167 0.26937 3.013 0.002585 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 374.08 on 377 degrees of freedom
## AIC: 384.08
##
## Number of Fisher Scoring iterations: 4
cbind(exp(coef(m2)), exp(confint(m2))) # Count odds and their confidence intervals.
## 2.5 % 97.5 %
## (Intercept) 0.0393658 0.01172067 0.1236057
## absences 1.0646959 1.03217181 1.1038908
## goout 2.0990949 1.66738691 2.6783765
## studytime 0.6371102 0.45038109 0.8845879
## sexM 2.2516753 1.33375527 3.8439285
# Predict the probability.
probabilities2 <- predict(m2, type = "response")
# Add the probabilities to alc.
alc <- mutate(alc, probability2 = probabilities2)
# Calculate a logical high use value based on probabilites.
alc <- mutate(alc, prediction2 = probability2 > 0.5)
# Re-run cross-validation and print the average number of wrong predictions.
cv2 <- cv.glm(data = alc, cost = mloss, glmfit = m2, K = 10)
cv2$delta[1]
## [1] 0.2094241
The average training error is now only 0.21, thus clearly lower than in the previous model.
The data of this classification and clustering exercise was the Housing Values in Suburbs of Boston dataset, henceforth referred to just as Boston. It is available from the R package MASS, which includes Functions and datasets to support Venables and Ripley, “Modern Applied Statistics with S” (4th edition, 2002).. According to the reference manual of the package, the dataset includes 506 observations of the following 14 variables:
According to the reference manual, the data is based on the following sources:
After the MASS package has been loaded – by calling library(MASS) – the built-in datasets can be prepared simply by calling their names using data(). This enables accessing them by their names. They’re, however, loaded by using so-called lazy loading and thus only become data frames when they’re accessed for the first time (see help("data") for details):
# Prepare the data. (Quotes are optional but recommended; see help("data").)
data("Boston")
Glimpse the data to confirm it matches the reference manual:
# The data is only turned into an actual data frame at this point.
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, ...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, ...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, ...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 1...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 1...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 3...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15,...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 1...
Explore the data graphically:
# Subplot axis labels are partially too cramped, but I failed to find a working solution for that.
p <- ggpairs(Boston, mapping = aes(), lower = list(combo = wrap("facethist", bins = 10)), upper = list(continuous = wrap("cor", size=3)))
p
Show variable summaries:
summary(Boston)
## crim zn indus
## Min. : 0.00632 Min. : 0.00 Min. : 0.46
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19
## Median : 0.25651 Median : 0.00 Median : 9.69
## Mean : 3.61352 Mean : 11.36 Mean :11.14
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10
## Max. :88.97620 Max. :100.00 Max. :27.74
## chas nox rm age
## Min. :0.00000 Min. :0.3850 Min. :3.561 Min. : 2.90
## 1st Qu.:0.00000 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02
## Median :0.00000 Median :0.5380 Median :6.208 Median : 77.50
## Mean :0.06917 Mean :0.5547 Mean :6.285 Mean : 68.57
## 3rd Qu.:0.00000 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08
## Max. :1.00000 Max. :0.8710 Max. :8.780 Max. :100.00
## dis rad tax ptratio
## Min. : 1.130 Min. : 1.000 Min. :187.0 Min. :12.60
## 1st Qu.: 2.100 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40
## Median : 3.207 Median : 5.000 Median :330.0 Median :19.05
## Mean : 3.795 Mean : 9.549 Mean :408.2 Mean :18.46
## 3rd Qu.: 5.188 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20
## Max. :12.127 Max. :24.000 Max. :711.0 Max. :22.00
## black lstat medv
## Min. : 0.32 Min. : 1.73 Min. : 5.00
## 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
## Median :391.44 Median :11.36 Median :21.20
## Mean :356.67 Mean :12.65 Mean :22.53
## 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :396.90 Max. :37.97 Max. :50.00
As can be seen from the output, almost all of the variables variables are continuous, with the exception of the Charles River bound tract (chas) variable, which is binary, and the radial highways accessibility variable (rad), which is an index, but still measured on an interval level.
The distribution of most variables is rather skewed, except for the dwelling size (number of rooms; rm), which is normally distributed and median value of owner-occupied homes, which is nearly normally distributed. Some variables are very highly skewed, like the crime rate (crim), proportion of land zoned for very large lots (zn) and proportion of black people (black).
There are a lot of variables in the data, so it might be helpful to calculate a correlation matrix of the data and visualise it:
# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type="upper", tl.pos="d", tl.cex=0.6)
A numerical equivalent of a correlation plot including only the highest correlations might be created as follows. I personally found this to be the most intuitive way to find the highest correlations:
cb <- as.data.frame(cor(Boston)) # Create a DF of the correlation matrix.
cor_matrix_high <- as.data.frame(matrix(nrow = 14, ncol = 14)) #Copy
colnames(cor_matrix_high) <- colnames(cor_matrix) #the structure of
rownames(cor_matrix_high) <- rownames(cor_matrix) #cor_matrix.
cor_threshold <- 0.7
# Loop through the correlation matrix and save only values that exceed the threshold.
for(col in names(cb)) {
for(row in 1:length(cb[[col]])) {
if(abs(cb[[col,row]]) > cor_threshold & abs(cb[[col,row]]) < 1) {
cor_matrix_high[col,as.character(rownames(cb)[row])] <- round(cb[[col,row]], digits = 2)
}
}
}
# Print the matrix.
cor_matrix_high
## crim zn indus chas nox rm age dis rad tax ptratio
## crim NA NA NA NA NA NA NA NA NA NA NA
## zn NA NA NA NA NA NA NA NA NA NA NA
## indus NA NA NA NA 0.76 NA NA -0.71 NA 0.72 NA
## chas NA NA NA NA NA NA NA NA NA NA NA
## nox NA NA 0.76 NA NA NA 0.73 -0.77 NA NA NA
## rm NA NA NA NA NA NA NA NA NA NA NA
## age NA NA NA NA 0.73 NA NA -0.75 NA NA NA
## dis NA NA -0.71 NA -0.77 NA -0.75 NA NA NA NA
## rad NA NA NA NA NA NA NA NA NA 0.91 NA
## tax NA NA 0.72 NA NA NA NA NA 0.91 NA NA
## ptratio NA NA NA NA NA NA NA NA NA NA NA
## black NA NA NA NA NA NA NA NA NA NA NA
## lstat NA NA NA NA NA NA NA NA NA NA NA
## medv NA NA NA NA NA NA NA NA NA NA NA
## black lstat medv
## crim NA NA NA
## zn NA NA NA
## indus NA NA NA
## chas NA NA NA
## nox NA NA NA
## rm NA NA NA
## age NA NA NA
## dis NA NA NA
## rad NA NA NA
## tax NA NA NA
## ptratio NA NA NA
## black NA NA NA
## lstat NA NA -0.74
## medv NA -0.74 NA
Variables tax (property taxes) and rad have a remarkably high correlation with each other: 0.91. This might mean that the taxation is at least partially based on the highway accessibility. Other relatively high correlations include e.g. the negative correlation between the amount of nitrogen oxides (nox) and rad (-0.77), positive correlation between industry presence (indus) and nox (0.76) and negative correlation between the proportion of pre-1940s buildings (age) and rad.
To further analyse the dataset, the dataset must be standardised, i.e. all variables fit to normal distribution so that the mean of every variable is zero. This can be done as follows:
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681
## Median :-0.2723 Median :-0.1441 Median :-0.1084
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515
## age dis rad
## Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
We can now see from the output of summary() that this works as intended. We also need to categorise our target variable – crim – to classify it:
# Create a quantile vector of crim, and use it to create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Replace the original unscaled variable.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
table(boston_scaled$crim) # Explore the categorised variable.
##
## low med_low med_high high
## 127 126 126 127
To create the LDA model and to test it, the data has to be divided into training and testing sets. This can be done as follows, choosing randomly 80% of the data to be used for training:
n <- nrow(boston_scaled) # Get number of rows in the dataset.
ind <- sample(n, size = n * 0.8) # Choose randomly 80% of the rows.
train <- boston_scaled[ind,] # Create train set.
test <- boston_scaled[-ind,] # Create test set.
# Save the correct classes from the test data.
correct_classes <- test$crime
# Remove the crime variable from the test data.
test <- dplyr::select(test, -crime)
With the data divided, it is now possible to fit the LDA model on the training set:
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2574257 0.2351485 0.2623762 0.2450495
##
## Group means:
## zn indus chas nox rm
## low 0.9310036 -0.8997841 -0.19661560 -0.8657659 0.39225629
## med_low -0.1207535 -0.2173649 -0.02367011 -0.5403276 -0.17901891
## med_high -0.3893509 0.1798569 0.21052285 0.4044867 0.09697913
## high -0.4872402 1.0171737 -0.07348562 1.0792632 -0.40302139
## age dis rad tax ptratio
## low -0.8783624 0.8938697 -0.6826072 -0.7189505 -0.43026262
## med_low -0.2498804 0.3086852 -0.5478716 -0.4335468 0.01373224
## med_high 0.4321568 -0.3845833 -0.3924693 -0.3093288 -0.32719705
## high 0.8306075 -0.8588203 1.6375616 1.5136504 0.78011702
## black lstat medv
## low 0.3734192 -0.74401589 0.46804882
## med_low 0.3039552 -0.08346247 -0.04934803
## med_high 0.0941871 0.04170050 0.16640034
## high -0.8108170 0.94270106 -0.73963535
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0955965246 0.70624698 -0.8239310
## indus -0.0358311928 -0.18625688 0.3297031
## chas -0.0856091798 -0.13405834 0.0568785
## nox 0.3437240310 -0.71040440 -1.3714271
## rm -0.0994877919 -0.09962215 -0.2056183
## age 0.3289055211 -0.36126888 0.0388215
## dis -0.0681938096 -0.28896012 0.2105911
## rad 3.0362894951 0.91854428 -0.2671302
## tax 0.0005978297 -0.02971497 0.7098499
## ptratio 0.0888520020 -0.02644323 -0.1454231
## black -0.1275380985 0.02144684 0.1211509
## lstat 0.2261932426 -0.22638028 0.3334836
## medv 0.1500272939 -0.41970832 -0.1130043
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9441 0.0420 0.0138
As we used quantiles to categorise the original variable, we’ve four classes. Thus, the output shows that we’ve three linear discriminants, as expected. Of these, the first explains vast majority – 94% – of the between-group variance.
The first two of the model’s linear discriminants can be visualised follows. A helper function is needed to draw the arrows in the biplot:
# Define a function for the biplot arrows.
lda.arrows <- function(x, myscale = 2, arrow_heads = 0.2, color = "deeppink", tex = 1, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime) # Turn the classes to numeric for plotting.
plot(lda.fit, dimen = 2, col = classes, pch = classes) # Plot.
lda.arrows(lda.fit) # Add arrows.
It’s possible to visualise all three discriminants, but the lda.arrows() function is incompatible with that:
plot(lda.fit, dimen = 3, col = classes, pch = classes) # Plot.
We’ve already prepared the test set above, so it’s now possible to move straight into predictions:
lda.pred <- predict(lda.fit, newdata = test) # Predict the test values.
# Cross tabulate the predictions with the correct values.
addmargins(table(correct = correct_classes, predicted = lda.pred$class))
## predicted
## correct low med_low med_high high Sum
## low 17 6 0 0 23
## med_low 11 15 5 0 31
## med_high 1 6 13 0 20
## high 0 0 0 28 28
## Sum 29 27 18 28 102
(I used ```addmargins() when tabulating, because in my opinion that’s more illustrative and helps. comparisons.) As seen from the table, the model did predict the highest of crime rates reliably, but the “med_low” category is overrepresented relative to the “low” and “med_high” categories. Thus, the model can be used to make crude predictions, but it’s hardly perfect. It might be better to use an unsupervised method and cluster the data instead of classifying it.
To cluster the data, it needs to be loaded and standardised again, and a distance matrix created out of it:
boston_scaled <- as.data.frame(scale(Boston)) # Standardise the data.
dist_eu <- dist(boston_scaled) # Create an euclidian distance matrix.
summary(dist_eu) # Summarise the matrix.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
We can try to cluster the data with k-means straight away. We used four classes for our LDA model, so we might try it with as many clusters instead:
km <-kmeans(dist_eu, centers = 4) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.
However, while the results look somewhat reasonable, the amount of clusters was merely a guess. To determine it properly, the total within cluster sum of squares (TWCSS) should be calculated. Let’s try it, with a maximum of 15 clusters:
k_max <- 15 # Maximum number of clusters to try.
# Define a function for testing.
k_try <- function(k) {
kmeans(dist_eu, k)$tot.withinss
}
# Calculate the total within sum of squares using the function.
twcss <- sapply(1:k_max, k_try)
# Visualize the results.
plot(1:k_max, twcss, type='b')
The optimal number of cluster is where the TWCSS drops radically; however, by inspecting the above plot, it’s somewhat debatable, whether this happens with with just two or four clusters. Thus, for comparison, let’s re-cluster the data with just two clusters:
km <-kmeans(dist_eu, centers = 2) # Cluster the data.
pairs(boston_scaled, col = km$cluster) # Plot the clusters.
As the plots above demonstrate, there seems to be less overlap between the clusters than with four clusters, which suggests that at least when using euclidian distance, the optimal number of clusters is indeed two. Because it’s possible that different distance measures produce different results, I also briefly tested my code by creating the dist_eu variable using the manhattan distance method instead, but found the results to be in this case so similar to the euclidian method that it’s not worth repeating those results here. The most notable difference was that with the manhattan method, the TWCSS plot hinted even more strongly that the optimal number of clusters is two.
This chapter has two distinct parts: in the first part, a Principal Component Analysis (PCA) of the combined data of the year 2015 Human Development Index (HDI) and Gender Inequality Index (GII) of the United Nations is done. In the second part, a Multiple Correspondence Analysis (MDA) is done for the “tea” dataset included in the FactoMineR R library.
The combined HDI + GII data – henceforth referred to as the Human data – was prepared by joining the two datasets by country, calculating two additional values from the existing values, excluding variables that were deemed irrelevant for the analysis, and leaving off any records that were incomplete or that referred to geographic areas other than countries. (For the data wrangling script that was used to prepare the data, see here).
The data is read in as follows:
# Read in the data
human <- as.data.frame(read.table('data/human.csv', sep="\t", header=TRUE))
Glimpse the data to explore its structure and dimensions:
# The data is only turned into an actual data frame at this point.
glimpse(human)
## Observations: 155
## Variables: 8
## $ se_f_of_m <dbl> 1.0072389, 0.9968288, 0.9834369, 0.9886128, 0....
## $ lfp_f_of_m <dbl> 0.8908297, 0.8189415, 0.8251001, 0.8840361, 0....
## $ edu_exp <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5, 18.6, 16.5...
## $ life_exp <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9, 80.9, 79.1...
## $ gni_cap <int> 64992, 42261, 56431, 44025, 45435, 43919, 3956...
## $ mmr <int> 4, 6, 6, 5, 6, 7, 9, 28, 11, 8, 6, 4, 8, 4, 27...
## $ abr <dbl> 7.8, 12.1, 1.9, 5.1, 6.2, 3.8, 8.2, 31.0, 14.5...
## $ mp_share <dbl> 39.6, 30.5, 28.5, 38.0, 36.9, 36.9, 19.9, 19.4...
As seen from the output, the dataset has eight variables and includes data for 155 countries. The included variables may be described as follows:
A graphical overview of the Human data and summaries of its variables can be displayed as follows:
ggpairs(human)
summary(human)
## se_f_of_m lfp_f_of_m edu_exp life_exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## gni_cap mmr abr mp_share
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
As seen from the plots, the distribution of most of the variables is relatively skewed; the most notable exceptions are gender ratio of population with secondary education and expected years in education, which are reasonably close to the normal distribution. The min and max values and the scatter plots both show that some variables also have quite significant outliers, e.g. GNI per capita and maternal mortality rate.
The correlation figures also show that some variables correlate highly with each other. To spot such high correlation more easily, it’s useful to draw a correlation diagram:
corrplot(round(cor(human), digits = 2), method = "circle", type = "upper", tl.pos = "d", tl.cex = 0.8)
The plot demonstrates clearly that the strongest correlation of all is the strong negative correlation between life expectancy and maternal mortality rate; from the previous plot we can see that its value is -0.857. The next strongest correlations are the positive correlations between life expectancy and education expectancy (0.789) and maternal mortality rate and adolescent birth rate (0.759).
To lower the dimensionality of the data, a principal component analysis (PCA) can be performed for it. This can be done, summarised and plotted as follows, using the singular value decomposition (SVD) method:
pca_human <- prcomp(human)
s_human_nonstd <- summary(pca_human)
s_human_nonstd
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000
## PC7 PC8
## Standard deviation 0.1912 0.1591
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000
However, as shown by the summary, when done this way, the first principal component explains virtually all of the variance, and plotting it doesn’t’ make much sense either; R actually throws some errors (that are deliberately displayed here), because it is unable to draw the plot arrows properly, and the resulting plot doesn’t really make much sense:
# Percentages of variance for the plot titles.
pr_shns <- round(100*s_human_nonstd$importance[2, ], digits = 1)
pc_shns_lab <- paste0(names(pr_shns), " (", pr_shns, "%)")
biplot(pca_human, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shns_lab[1], ylab = pc_shns_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L],
## length = arrow.len): zero-length arrow is of indeterminate angle and
## so skipped
This is because PCA is sensitive to the relative scaling of the original features and assumes that features with larger variance are more important than features with smaller variance. Thus, at least unless the variables are already very close to the normal distribution, it is important to standardise the variables – and in this case, most of the variables weren’t even close to the normal distribution. After standardising the variables, the results look very different indeed:
human_std <- scale(human) # Standardise the variables.
pca_human_std <- prcomp(human_std)
s_human_std <- summary(pca_human_std)
s_human_std
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069
## PC7 PC8
## Standard deviation 0.45900 0.32224
## Proportion of Variance 0.02634 0.01298
## Cumulative Proportion 0.98702 1.00000
# Percentages of variance for the plot titles.
pr_shs <- round(100*s_human_std$importance[2, ], digits = 1)
pc_shs_lab <- paste0(names(pr_shs), " (", pr_shs, "%)")
biplot(pca_human_std, choices = 1:2, cex = c(0.6, 1), col = c("grey40", "deeppink2"), xlab = pc_shs_lab[1], ylab = pc_shs_lab[2])
The results now actually make sense, and as can be read from both the summary of the analysis and from the biplot of it, the first principal component explains 53.6% of the total variance of the original eight variables, and the second principal component explains 16.2% of it.
By inspecting the plot, we can see that the first principal component represents variables related mostly to poverty and the second principal components variables related mostly to equality, and that these two don’t correlate with other.
We can thus say that poverty explains most of the total variance in the data, but equality also explains some of it. From the plot arrows, we can see that high maternal mortality rate and adolescent birth rate correlate strongly with poverty and that high life expectancy, high educational expectancy, high ratio of females with secondary education and high GNI have a strong negative correlation with it. Further, we can see that high ratio of female MPs and high ratio of female participation in the labour force have strong positive correlation with equality, even though equality explains much less of the total variability in the data than poverty.
As the final part of the Exercise 5, a Multiple Correspondence Analysis (MCA) was done for the “tea” dataset included in the FactoMineR R library. The package description explains the data as follows: ”We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”
Let’s load the data and look at the structure and dimensions of the dataset: Glimpse the data to explore its structure and dimensions:
data("tea")
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast <fctr> breakfast, breakfast, Not.breakfast, No...
## $ tea.time <fctr> Not.tea time, Not.tea time, tea time, N...
## $ evening <fctr> Not.evening, Not.evening, evening, Not....
## $ lunch <fctr> Not.lunch, Not.lunch, Not.lunch, Not.lu...
## $ dinner <fctr> Not.dinner, Not.dinner, dinner, dinner,...
## $ always <fctr> Not.always, Not.always, Not.always, Not...
## $ home <fctr> home, home, home, home, home, home, hom...
## $ work <fctr> Not.work, Not.work, work, Not.work, Not...
## $ tearoom <fctr> Not.tearoom, Not.tearoom, Not.tearoom, ...
## $ friends <fctr> Not.friends, Not.friends, friends, Not....
## $ resto <fctr> Not.resto, Not.resto, resto, Not.resto,...
## $ pub <fctr> Not.pub, Not.pub, Not.pub, Not.pub, Not...
## $ Tea <fctr> black, black, Earl Grey, Earl Grey, Ear...
## $ How <fctr> alone, milk, alone, alone, alone, alone...
## $ sugar <fctr> sugar, No.sugar, No.sugar, sugar, No.su...
## $ how <fctr> tea bag, tea bag, tea bag, tea bag, tea...
## $ where <fctr> chain store, chain store, chain store, ...
## $ price <fctr> p_unknown, p_variable, p_variable, p_va...
## $ age <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, ...
## $ sex <fctr> M, F, F, M, M, M, M, F, M, M, M, M, M, ...
## $ SPC <fctr> middle, middle, other worker, student, ...
## $ Sport <fctr> sportsman, sportsman, sportsman, Not.sp...
## $ age_Q <fctr> 35-44, 45-59, 45-59, 15-24, 45-59, 15-2...
## $ frequency <fctr> 1/day, 1/day, +2/day, 1/day, +2/day, 1/...
## $ escape.exoticism <fctr> Not.escape-exoticism, escape-exoticism,...
## $ spirituality <fctr> Not.spirituality, Not.spirituality, Not...
## $ healthy <fctr> healthy, healthy, healthy, healthy, Not...
## $ diuretic <fctr> Not.diuretic, diuretic, diuretic, Not.d...
## $ friendliness <fctr> Not.friendliness, Not.friendliness, fri...
## $ iron.absorption <fctr> Not.iron absorption, Not.iron absorptio...
## $ feminine <fctr> Not.feminine, Not.feminine, Not.feminin...
## $ sophisticated <fctr> Not.sophisticated, Not.sophisticated, N...
## $ slimming <fctr> No.slimming, No.slimming, No.slimming, ...
## $ exciting <fctr> No.exciting, exciting, No.exciting, No....
## $ relaxing <fctr> No.relaxing, No.relaxing, relaxing, rel...
## $ effect.on.health <fctr> No.effect on health, No.effect on healt...
We can see that except for the age, all of the variables are categorial – many of them actually binary – and there are 36 of them in total. It’s difficult to visualise or analyse the whole of such a large dataset at once; I actually tried to call ggpairs() for the whole dataset once, but R simply failed to create the temporary image file required to display the plot. Thus, I’m subsetting the data – creating a new dataset called chai in the process – and picking up variables that tell what kind of tea people drink, whether they’re drinking it on the evenings or not, and whether they’re drinking it with their friends or not:
# Etymology note: ”tea” is of Fujian, ”chai” of Cantonese origin. They
# both mean the same. For details, see http://wals.info/chapter/138
chai <- dplyr::select(tea, one_of(c('Tea','How','sugar','how','evening','friends')))
# Rename some columns for clarity.
names(chai)[1] <- 'type'
names(chai)[2] <- 'extras'
names(chai)[4] <- 'packaging'
glimpse(chai)
## Observations: 300
## Variables: 6
## $ type <fctr> black, black, Earl Grey, Earl Grey, Earl Grey,...
## $ extras <fctr> alone, milk, alone, alone, alone, alone, alone...
## $ sugar <fctr> sugar, No.sugar, No.sugar, sugar, No.sugar, No...
## $ packaging <fctr> tea bag, tea bag, tea bag, tea bag, tea bag, t...
## $ evening <fctr> Not.evening, Not.evening, evening, Not.evening...
## $ friends <fctr> Not.friends, Not.friends, friends, Not.friends...
It’s now more feasible to plot the remaining variables:
ggplot(gather(chai), aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
Then, let’s do an MCA for the data and summarise the model:
chai_mca <- MCA(chai, graph = FALSE)
summary(chai_mca)
##
## Call:
## MCA(X = chai, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.240 0.218 0.196 0.181 0.164 0.161
## % of var. 14.378 13.051 11.757 10.856 9.849 9.643
## Cumulative % of var. 14.378 27.429 39.185 50.041 59.890 69.533
## Dim.7 Dim.8 Dim.9 Dim.10
## Variance 0.150 0.131 0.128 0.099
## % of var. 8.974 7.871 7.652 5.970
## Cumulative % of var. 78.507 86.378 94.030 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.553 0.426 0.235 | -0.309 0.146 0.073 | 0.149
## 2 | 0.936 1.218 0.481 | -0.238 0.087 0.031 | 0.587
## 3 | -0.291 0.117 0.097 | 0.124 0.024 0.018 | -0.214
## 4 | 0.122 0.021 0.017 | -0.720 0.794 0.583 | -0.090
## 5 | 0.141 0.027 0.018 | -0.163 0.041 0.024 | -0.280
## 6 | 0.466 0.302 0.250 | -0.410 0.258 0.194 | -0.133
## 7 | 0.035 0.002 0.002 | -0.123 0.023 0.024 | -0.067
## 8 | 0.611 0.519 0.182 | 0.009 0.000 0.000 | 0.439
## 9 | 0.370 0.191 0.083 | -0.204 0.064 0.025 | 0.343
## 10 | 0.437 0.266 0.109 | 0.693 0.735 0.274 | -0.046
## ctr cos2
## 1 0.038 0.017 |
## 2 0.586 0.189 |
## 3 0.078 0.053 |
## 4 0.014 0.009 |
## 5 0.134 0.072 |
## 6 0.030 0.020 |
## 7 0.008 0.007 |
## 8 0.328 0.094 |
## 9 0.201 0.072 |
## 10 0.004 0.001 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.802 11.027 0.210 7.933 | 0.857 13.878 0.240
## Earl Grey | -0.465 9.667 0.390 -10.794 | -0.292 4.209 0.154
## green | 0.921 6.483 0.105 5.596 | -0.213 0.381 0.006
## alone | 0.097 0.422 0.017 2.277 | 0.072 0.256 0.010
## lemon | -1.157 10.244 0.165 -7.034 | 0.274 0.633 0.009
## milk | 0.211 0.649 0.012 1.879 | -0.596 5.717 0.094
## other | 0.674 0.946 0.014 2.048 | 1.614 5.992 0.081
## No.sugar | 0.488 8.545 0.254 8.718 | 0.419 6.946 0.188
## sugar | -0.521 9.135 0.254 -8.718 | -0.448 7.425 0.188
## tea bag | 0.093 0.344 0.011 1.847 | -0.584 14.813 0.446
## v.test Dim.3 ctr cos2 v.test
## black 8.479 | 0.619 8.040 0.125 6.125 |
## Earl Grey -6.786 | -0.016 0.014 0.000 -0.371 |
## green -1.292 | -1.295 15.684 0.207 -7.871 |
## alone 1.689 | -0.464 11.897 0.400 -10.931 |
## lemon 1.666 | 0.213 0.423 0.006 1.292 |
## milk -5.314 | 0.813 11.801 0.176 7.247 |
## other 4.910 | 3.582 32.731 0.397 10.891 |
## No.sugar 7.489 | -0.055 0.134 0.003 -0.987 |
## sugar -7.489 | 0.059 0.143 0.003 0.987 |
## tea bag -11.550 | 0.162 1.271 0.034 3.211 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## type | 0.391 0.241 0.279 |
## extras | 0.176 0.164 0.668 |
## sugar | 0.254 0.188 0.003 |
## packaging | 0.048 0.458 0.183 |
## evening | 0.206 0.108 0.035 |
## friends | 0.363 0.146 0.007 |
What can be seen from the output of the model summary is that none of the dimensions retain a large percentage of the total variance, and that there are no strong links between the dimensions and the categorical variables, even though some of the categories do have coordinate values significantly different from zero. Drawing a variable biplot of the analysis confirms our findings:
par(mfrow = c(1,3)) # Set some graphical params.
plot(chai_mca, choix = "var", title = "MCA variables") # The variable biplot.
plot(chai_mca, choix = "ind", invisible = "var") # The individuals plot.
plot(chai_mca, choix = "ind", invisible = "ind") # The categories plot.
As can be seen from the plot on the left, none of the variables are very strongly linked to either of the dimensions. The strongest link is the packaging variable’s – i.e. whether the person prefers to drink loose tea, teabags or both – link to the second dimension. The rest of the links are quite weak.
The plots on the center and on the right represent the individuals and categories, respectively, and demonstrate that there is no clear pattern in either of them.